Exploring Baseball Data with R
2026-02-26
“A picture is worth a thousand numbers.”
We will learn to turn baseball statistics into pictures — and learn to read what those pictures are saying.
The Lahman package contains season-by-season statistics for every MLB player since 1871:
We will focus on the Batting table
Each row = one player’s stats for one season
Key statistics: AB (at-bats), H (hits), HR (home runs), SO (strikeouts), BB (walks)
A histogram shows how the values of one variable are spread out
Good for answering:
Three things to notice:
1. Where is it tallest? Around 0–10 home runs — most regulars are not power hitters.
2. What is the shape? The bars fall off gradually to the right. This is called a right-skewed distribution.
3. How far does the tail reach? A very small number of players reach 50+ HRs. These are the elite power hitters.
A few extreme outliers pull the distribution to the right. Most players are near zero; a few are exceptional.
Values spread out roughly equally above and below the average. This bell-like shape is called symmetric.
Notice the thicker right tail in the 2000s panel — the steroid era fingerprint.
A density plot is a smoothed histogram
Why use it?
Density plots are better than histograms for comparing two groups — overlapping curves are much easier to read than overlapping bars.
Steroid era (red):
Thicker far-right tail. A handful of players posted historically unprecedented 50–70 HR seasons.
Modern era (blue):
Higher in the 15–30 HR range — more players adopting power-oriented “launch-angle” swings, but fewer extreme outliers.
The curves have the same shape but the modern era (dark blue) sits noticeably to the left — the statistical signature of today’s strikeout-heavy, power-first game.
A scatterplot shows the relationship between two variables
Good for answering:
Direction: The cloud trends upward left to right — more strikeouts tends to mean more home runs. Players who swing for the fences also miss more often.
Strength: The relationship is moderate — there is a clear trend but a lot of scatter. Strikeouts do not perfectly predict home runs.
Outliers: The grey box shows elite power hitters: high strikeout totals but also exceptional home run numbers.
Rising section: Through most of the data, the trend goes up — power and strikeouts rise together.
Flattening section: At very high strikeout totals (200+), the trend levels off. Too much swinging and missing becomes counterproductive even for power hitters.
Shaded band: This is the 95% confidence interval — the range of plausible trend lines. Wider band = less certainty.
Dark blue (walks often) clusters near the top of the plot — combining home run power with plate discipline.
These are the most dangerous hitters. Pitchers cannot simply avoid the strike zone (the batter will take a walk), so they must attack — and get punished when they make a mistake.
Red (rarely walks) spreads broadly — many free swingers strike out often without generating home runs in return.
A strikeout is, by definition, an at-bat with no hit — so more strikeouts directly pull batting average down. But there is still a lot of scatter around the trend.
A trend line that slopes up = positive relationship. A trend line that slopes down = negative relationship. The steeper the slope, the stronger the relationship.
Histogram
How is one variable spread out?
Density Plot
How do two groups compare?
Scatterplot
How do two variables relate?
More plot types to explore:
Companion HTML document contains full explanations and all R code.
Introduction to Graphical Methods | Baseball Data